There is a growing interest in developing unlearnable examples (UEs) against visual privacy leaks on the Internet. UEs are training samples added with invisible but unlearnable noise, which have been found can prevent unauthorized training of machine learning models. UEs typically are generated via a bilevel optimization framework with a surrogate model to remove (minimize) errors from the original samples, and then applied to protect the data against unknown target models. However, existing UE generation methods all rely on an ideal assumption called label-consistency, where the hackers and protectors are assumed to hold the same label for a given sample. In this work, we propose and promote a more practical label-agnostic setting, where the hackers may exploit the protected data quite differently from the protectors. E.g., a m-class unlearnable dataset held by the protector may be exploited by the hacker as a n-class dataset. Existing UE generation methods are rendered ineffective in this challenging setting. To tackle this challenge, we present a novel technique called Unlearnable Clusters (UCs) to generate label-agnostic unlearnable examples with cluster-wise perturbations. Furthermore, we propose to leverage VisionandLanguage Pre-trained Models (VLPMs) like CLIP as the surrogate model to improve the transferability of the crafted UCs to diverse domains. We empirically verify the effectiveness of our proposed approach under a variety of settings with different datasets, target models, and even commercial platforms Microsoft Azure and Baidu PaddlePaddle.
translated by 谷歌翻译
Unbiased learning to rank (ULTR) studies the problem of mitigating various biases from implicit user feedback data such as clicks, and has been receiving considerable attention recently. A popular ULTR approach for real-world applications uses a two-tower architecture, where click modeling is factorized into a relevance tower with regular input features, and a bias tower with bias-relevant inputs such as the position of a document. A successful factorization will allow the relevance tower to be exempt from biases. In this work, we identify a critical issue that existing ULTR methods ignored - the bias tower can be confounded with the relevance tower via the underlying true relevance. In particular, the positions were determined by the logging policy, i.e., the previous production model, which would possess relevance information. We give both theoretical analysis and empirical results to show the negative effects on relevance tower due to such a correlation. We then propose three methods to mitigate the negative confounding effects by better disentangling relevance and bias. Empirical results on both controlled public datasets and a large-scale industry dataset show the effectiveness of the proposed approaches.
translated by 谷歌翻译
Chain-of-Thought (CoT) prompting can dramatically improve the multi-step reasoning abilities of large language models (LLMs). CoT explicitly encourages the LLM to generate intermediate rationales for solving a problem, by providing a series of reasoning steps in the demonstrations. Despite its success, there is still little understanding of what makes CoT prompting effective and which aspects of the demonstrated reasoning steps contribute to its performance. In this paper, we show that CoT reasoning is possible even with invalid demonstrations - prompting with invalid reasoning steps can achieve over 80-90% of the performance obtained using CoT under various metrics, while still generating coherent lines of reasoning during inference. Further experiments show that other aspects of the rationales, such as being relevant to the query and correctly ordering the reasoning steps, are much more important for effective CoT reasoning. Overall, these findings both deepen our understanding of CoT prompting, and open up new questions regarding LLMs' capability to learn to reason in context.
translated by 谷歌翻译
Bird's-Eye-View (BEV) 3D Object Detection is a crucial multi-view technique for autonomous driving systems. Recently, plenty of works are proposed, following a similar paradigm consisting of three essential components, i.e., camera feature extraction, BEV feature construction, and task heads. Among the three components, BEV feature construction is BEV-specific compared with 2D tasks. Existing methods aggregate the multi-view camera features to the flattened grid in order to construct the BEV feature. However, flattening the BEV space along the height dimension fails to emphasize the informative features of different heights. For example, the barrier is located at a low height while the truck is located at a high height. In this paper, we propose a novel method named BEV Slice Attention Network (BEV-SAN) for exploiting the intrinsic characteristics of different heights. Instead of flattening the BEV space, we first sample along the height dimension to build the global and local BEV slices. Then, the features of BEV slices are aggregated from the camera features and merged by the attention mechanism. Finally, we fuse the merged local and global BEV features by a transformer to generate the final feature map for task heads. The purpose of local BEV slices is to emphasize informative heights. In order to find them, we further propose a LiDAR-guided sampling strategy to leverage the statistical distribution of LiDAR to determine the heights of local slices. Compared with uniform sampling, LiDAR-guided sampling can determine more informative heights. We conduct detailed experiments to demonstrate the effectiveness of BEV-SAN. Code will be released.
translated by 谷歌翻译
In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech and text data, where speech-pseudo-codes pairs and phoneme-text pairs are a supplement to the supervised speech-text pairs. To train the encoder to learn better speech representation, we introduce self-supervised masked speech prediction (MSP) and supervised phoneme prediction (PP) tasks to learn to map speech into phonemes. Besides, we directly add the downstream supervised speech-to-text (S2T) task into the pre-training process, which can further improve the pre-training performance and achieve better recognition results even without fine-tuning. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
translated by 谷歌翻译
鉴于其广泛的应用,已经对人面部交换的任务进行了许多尝试。尽管现有的方法主要依赖于乏味的网络和损失设计,但它们仍然在源和目标面之间的信息平衡中挣扎,并倾向于产生可见的人工制品。在这项工作中,我们引入了一个名为StylesWap的简洁有效的框架。我们的核心想法是利用基于样式的生成器来增强高保真性和稳健的面部交换,因此可以采用发电机的优势来优化身份相似性。我们仅通过最小的修改来确定,StyleGAN2体系结构可以成功地处理来自源和目标的所需信息。此外,受到TORGB层的启发,进一步设计了交换驱动的面具分支以改善信息的融合。此外,可以采用stylegan倒置的优势。特别是,提出了交换引导的ID反转策略来优化身份相似性。广泛的实验验证了我们的框架会产生高质量的面部交换结果,从而超过了最先进的方法,既有定性和定量。
translated by 谷歌翻译
本文考虑了使用嵌入式设备来获取和分类图像的设置。由于计算能力有限,嵌入式设备依赖于具有不平衡精度的简约分类模型。当认为本地分类不准确时,设备可以决定使用更准确但资源密集型的模型将图像卸载到边缘服务器。但是,资源限制(例如,网络带宽)需要调节这种传输,以避免交通拥堵和高延迟。当传输调节是通过令牌桶时,该论文调查了此卸载问题,该机制通常用于此类目的。目的是设计一种轻巧的在线卸载策略,该策略在令牌存储桶的限制下优化了特定于应用程序的指标(例如,分类精度)。该论文制定了基于深Q网络(DQN)的政策,并证明了其功效和在嵌入式设备上部署的可行性。值得注意的是,该策略可以处理复杂的输入模式,包括图像到达中的相关性和分类精度。评估是通过使用来自Imagenet图像分类基准生成的合成痕迹对局部测试床进行图像分类进行的。这项工作的实施可在https://github.com/qiujiaming315/edgeml-dqn上获得。
translated by 谷歌翻译
在本文中,我们介绍了全景语义细分,该分段以整体方式提供了对周围环境的全景和密集的像素的理解。由于两个关键的挑战,全景分割尚未探索:(1)全景上的图像扭曲和对象变形; (2)缺乏培训全景分段的注释。为了解决这些问题,我们提出了一个用于全景语义细分(Trans4Pass)体系结构的变压器。首先,为了增强失真意识,Trans4Pass配备了可变形的贴片嵌入(DPE)和可变形的MLP(DMLP)模块,能够在适应之前(适应之前或之后)和任何地方(浅层或深度级别的(浅层或深度))和图像变形(通过任何涉及(浅层或深层))和图像变形(通过任何地方)和图像变形设计。我们进一步介绍了升级后的Trans4Pass+模型,其中包含具有平行令牌混合的DMLPV2,以提高建模歧视性线索的灵活性和概括性。其次,我们提出了一种无监督域适应性的相互典型适应(MPA)策略。第三,除了针孔到型 - 帕诺amic(PIN2PAN)适应外,我们还创建了一个新的数据集(Synpass),其中具有9,080个全景图像,以探索360 {\ deg} Imagery中的合成对真实(Syn2real)适应方案。进行了广泛的实验,这些实验涵盖室内和室外场景,并且使用PIN2PAN和SYN2REAL方案进行了研究。 Trans4Pass+在四个域自适应的全景语义分割基准上实现最先进的性能。代码可从https://github.com/jamycheung/trans4pass获得。
translated by 谷歌翻译
最近,深层神经网络(DNNS)用于减少带宽并提高互联网视频交付的质量。现有的方法训练服务器上每个视频块的相应内容超级分辨率(SR)模型,并将低分辨率(LR)视频块以及SR模型一起流到客户端。尽管他们取得了令人鼓舞的结果,但网络培训的巨大计算成本限制了其实际应用。在本文中,我们提出了一种名为有效元调整(EMT)的方法,以降低计算成本。 EMT没有从头开始训练,而是将元学习的模型适应了输入视频的第一部分。至于以下块,它通过以前的改编模型的梯度掩盖选择了部分参数。为了实现EMT的进一步加速,我们提出了一种新颖的抽样策略,以从视频帧中提取最具挑战性的补丁。拟议的策略高效,带来了可忽略的额外成本。我们的方法大大降低了计算成本并取得更好的性能,为将神经视频传递技术应用于实际应用铺平了道路。我们基于各种有效的SR架构进行了广泛的实验,包括ESPCN,SRCNN,FSRCNN和EDSR-1,证明了我们工作的概括能力。该代码通过\ url {https://github.com/neural-video-delivery/emt-pytorch-eccv2022}发布。
translated by 谷歌翻译
语言模型既展示了定量的改进,又展示了新的定性功能,随着规模的增加。尽管它们具有潜在的变革性影响,但这些新能力的特征却很差。为了为未来的研究提供信息,为破坏性的新模型能力做准备,并改善社会有害的效果,至关重要的是,我们必须了解目前和近乎未来的能力和语言模型的局限性。为了应对这一挑战,我们介绍了超越模仿游戏基准(Big Bench)。 Big Bench目前由204个任务组成,由132家机构的442位作者贡献。任务主题是多样的,从语言学,儿童发展,数学,常识性推理,生物学,物理学,社会偏见,软件开发等等。 Big-Bench专注于被认为超出当前语言模型的功能的任务。我们评估了OpenAI的GPT型号,Google内部密集变压器体系结构和大型基础上的开关稀疏变压器的行为,跨越了数百万到数十亿个参数。此外,一个人类专家评估者团队执行了所有任务,以提供强大的基准。研究结果包括:模型性能和校准都随规模改善,但绝对的术语(以及与评估者的性能相比);在模型类中的性能非常相似,尽管带有稀疏性。逐渐和预测的任务通常涉及大量知识或记忆成分,而在临界规模上表现出“突破性”行为的任务通常涉及多个步骤或组成部分或脆性指标;社交偏见通常会随着含糊不清的环境而随着规模而增加,但这可以通过提示来改善。
translated by 谷歌翻译